11/30/2018

University of Arkansas Geosciences Colloquia

Statistical Modeling

  • Drawing conclusions based on data while accounting for random variation including sampling and observational errors.

  • Goal: Make inference about the state of the world using data.

  • Most statisitics is taught as a recipe.

  • If your data are X then do Y.

  • Where is the creativity? Science?

Problem

  • Where is the science?

  • Don't we know something about the world other than our data is X?

  • How do we add this knowledge into our modeling?

Scientifically Motivated Statistical Modeling

  • Probability model.

    • Model encodes our understanding of the scientific process of interest.

    • Model accounts for as much uncertainty as possible.

    • Model results in a probability distribution.


  • Update model with data.

    • Use the model to generate parameter estimates given data.

Scientifically Motivated Statistical Modeling

  • Criticize the model

    • Does the model fit the data well?

    • Do the predictions make sense?

    • Are there subsets of the data that don't fit the model well?


  • Make inference using the model.

    • If the model fits the data, use the model fit for prediction or inference.

Probability Distributions

  • Start with probability distributions:

    • Data \(\mathbf{y}\).

    • Parameters \(\boldsymbol{\theta}\).

    • \([\mathbf{y}]\) is the probability distribution of \(\mathbf{y}\).

    • \([\mathbf{y} | \boldsymbol{\theta}]\) is the conditional probability distribution of \(\mathbf{y}\) given parameters \(\boldsymbol{\theta}\).

Example: linear regression

\[ \begin{align*} \left[y_i | \boldsymbol{\theta} \right] & \sim \operatorname{N}(X_i \beta, \sigma^2) \\ \boldsymbol{\theta} & = (\beta, \sigma^2) \end{align*} \]

Model Framework

  • Hierarchical model:

    • A model built in components.

    • Each component represents a different statistical goal.


  • Break the model into components:

    • Data Model.

    • Process Model.

    • Prior Model.


  • Combined, the data model, the process model, and the prior model define a posterior distribution.

Data Model

\[ {\huge \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}]} [\mathbf{z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*} }% \]

Data model

\[ {\huge \begin{align*} \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}]} \end{align*} } \]

  • Describes how the data are collected and observed.
    • Account for measurement process and uncertainty.
    • Model the data in the manner in which they were collected.

  • Data \(\mathbf{y}\).
    • Noisy data.
    • Inexpensive data.
    • Not what you want to make inference on but close.

Data model

\[ {\huge \begin{align*} \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}]} \end{align*} } \]

  • Latent variables \(\mathbf{z}\).
    • Think of \(\mathbf{z}\) as the ideal data.
    • No measurement error - the exact quantity you want to observe but can't.

  • Data model parameters \(\boldsymbol{\theta}_D\).

Data model: Examples

\[ {\huge \begin{align*} \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}]} \end{align*} } \]

  • Age of minerals:

    • \(\mathbf{y}\) is the radio-date estimate.

    • \(\mathbf{z}\) is the true mineral age.

    • \(\theta_D\) is the radio-date standard error.

    • The probability distribution is determined by the measurement process.

Data model: Examples

\[ {\huge \begin{align*} \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}]} \end{align*} } \]

  • Reconstructing climate from tree rings

    • \(\mathbf{y}\) is the tree ring width increment.

    • \(\mathbf{z}\) is the true, unobserved climate variable.

    • \(\boldsymbol{\theta}_D\) models the relationship between climate, stand dynamics, individual heterogeneity, tree age, (etc.) and tree ring width.

    • The probability distribution is determined by tree physiology, measurement uncertainty, etc.

Process Model

\[ {\huge \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}] \color{blue}{[\mathbf{z} | \boldsymbol{\theta}_P]}[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*} } \]

Process Model

\[ {\huge \begin{align*} \color{blue}{[\mathbf{z} | \boldsymbol{\theta}_P]} \end{align*} } \]

  • Where the science happens!

  • Latent process \(\mathbf{z}\) is modeled given data \(\mathbf{y}\).
    • Can be dynamic in space and/or time

  • Process parameters \(\boldsymbol{\theta}_P\).

  • Virtually all interesting scientific questions can be made with inference about \(\mathbf{z}\)

Process Model: Examples

\[ {\huge \begin{align*} \color{blue}{[\mathbf{z} | \boldsymbol{\theta}_P]} \end{align*} } \]

  • Sediment Mixing:
    • Model different mineral creation events.
    • Model mixing of sediments over time.
    • \(\mathbf{z}\) includes the true unobserved mineral age as well as the discrete mineral creation event.
    • \(\boldsymbol{\theta}_P\) includes the duration of the minearl creation event, the number of mineral creation events, and the relative mixing of rock to produce a sediment.

Process Model: Examples

\[ {\huge \begin{align*} \color{blue}{[\mathbf{z} | \boldsymbol{\theta}_P]} \end{align*} } \]

  • Reconstructing climate with tree rings

    • Trees of the same species share a similar response to climate.

    • Climate variables at sites nearby in location are closer to each other than sites far apart, on average.

    • Climate variables seperated by short periods of time are more similar than climate variables over long periods of time.

    • \(\mathbf{z}\) is the value of the unobserved climate variables.

    • \(\boldsymbol{\theta}_P\) are the species-specific growth responses and the correlation of climate across time and space.

Prior Model

\[ {\huge \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}] [\mathbf{z} | \boldsymbol{\theta}_P] \color{orange}{[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P]} \end{align*} } \]

Prior Model

Prior Model

  • Probability distributions define "reasonable" ranges for parameters.

  • Prior models are useful for a variety of problems:
    • Choosing important variables.
    • Preventing overfitting (regularization).
    • "Pooling" estimates across categories.

Posterior Distribution

\[ {\huge \begin{align*} \color{cyan}{[\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}] [\mathbf{z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*} } \]

Posterior distribution

\[ {\huge \begin{align*} \color{cyan}{[\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} \end{align*} } \]

  • Probability distribution over all unknowns in the model.

  • Inference is made using the posterior distribution.

  • Because the posterior distribution is a probability distribution, uncertainty is easy to calculate.

Example: Climate change

  • Climate change is well understood globally.

  • Climate change is less well understood locally.

  • Need for spatially explicit reconstructions of climate variables.

  • Problem: data sources are messy and noisy.

Local Prediction

Introduction

Predicting the future by learning from the past

Predicting the future by learning from the past

  • Vegetation composition and structure change from ice age to current period.

  • Using change in temperature to predict future vegetation change.


Predicting the future by learning from the past

  • Classify compositional and structural change.

Predicting the future by learning from the past

  • Ordered multi-logistic B-spline regression.


  • Learn vegetation change in structure/composition given temperature change.


  • Forecast future vegetation change.



Predicting the future by learning from the past

  • Data model: Multi-logit distribution for ordered categories of observed change.

  • Process model: Assumes increasing temperature results in smooth changes of composition and struction.

  • Prior model: Not used.

Modeling Sediment mixing

Sediment Mixing

Sharman and Johnstone (2017). Sediment unmixing using detrital geochronology. Earth and Palenetary Science Letters.

Goals

  • Estimate proportion of each parent in a daughter.

  • Reconstruct unobserved parent distributions given daughters.

Data

Mixing Model: Estimate proportion of each parent in a daughter.

Data Model

  • Dating uncertainty.

  • \(y_{ib}\): age measurement on mineral \(i=1, \ldots, N_b\) for parent \(b = 1, \ldots, B\).

  • \(y_{id}\): age measurement for mineral \(i=1, \ldots, N\) of daughter \(d\).

  • Measurement error reported as standard deviation \(\sigma_{ib}\) (\(\sigma_{id}\)).

\[ \begin{align*} y_{ib} & \sim \operatorname{N}(z_{ib}, \sigma^2_{ib}). \\ y_{id} & \sim \operatorname{N}(z_{id}, \sigma^2_{id}). \end{align*} \]

Process Model


Assumptions:


  • An unknown number of mineral creation events that are relatively discrete in geologic time.

  • Each parent is an mixture of minerals from creation events.

  • Each daughter is a mixture of parents.

Process Model

  • For each parent \(b\), the proportion of creation events is \(\mathbf{p}_b\) where \(p_{bk} > 0\) and \(\sum_{k=1}^\infty p_{bk} = 1\).

  • Most of the \(p_{bk}\)s are 0 (reasonable in real world).

  • Unkown number of mineral creation events.

Process Model: Parent Distribution

\[ \begin{align*} {z}_{ib} \sim \sum_{k=1}^\infty p_{bk} \operatorname{N}(\mu_k, \sigma^2_k). \end{align*} \]

drawing drawing drawing

Dirichlet Process

  • Assigns observations to clusters.

  • Model for \(\mathbf{p}_b\).

  • Number of clusters increases with number of observations.

Dirichlet Process

Dirichlet Process

Many Creation Events
drawing drawing drawing

Dirichlet Process

Few Creation Events
drawing drawing drawing

Process Model - Mixing Model

  • Daughter is a mixture of parents.

\[ \begin{align*} z_{id} & \sim \sum_{b=1}^B \phi_b \sum_{k=1}^K p_{bk} \operatorname{N}(\mu_k, \sigma^2_k). \end{align*} \]

  • \(\phi_b\) is the proportion of daughter sediments from parent \(b\).

Process Model - Mixing Model

\[ \begin{align*} \phi_1 = 0.200 \quad\quad\,\,\, \phi_2 = 0.532 \quad\quad\,\,\,\,\, \phi_3 = 0.268 \,\,\quad\quad\quad \mbox{Daughter} \end{align*} \]

drawing drawing drawing drawing

Simulation Study: Mixing

drawing

Simulation Study: Mixing

drawing

Mixing Estimates

drawing drawing

Unmixing Model: Estimate unobserved parent distributions.

Unmixing

drawing

Unmixing Model


\[ \begin{align*} y_{id} & \sim \operatorname{N}(z_{id}, \sigma^2_{id}). \end{align*} \]


\[ \begin{align*} z_{id} & \sim \sum_{b=1}^B \phi_{db} \sum_{k=1}^K p_{bk} \operatorname{N}(\mu_k, \sigma^2_k). \end{align*} \]

Simulation Study: Unmixing

drawing

Simulation Study: Unmixing

drawing

Unmixing Reconstructions

drawing drawing

Benefits:


  • Dirichlet process model can be used to describe sediment mixing processes.

  • Estimation of mixing and reconstruction with uncertainty.

  • Can ask questions like: what is probability at least 50% of sediment from daughter \(d\) is from parent \(b\):

\[ \begin{align*} \sum_{k=1}^K I\{ \phi_b^{(k)} > 0.5 \}. \end{align*} \]

Future extensions:

  • Account for spatial correlation among daughters.

  • Account for temporal correlation within a sediment core.

Learning about the past: Climate Proxy Data

Climate proxy data

  • Many ecological and physical processes respond to climate over different time scales.
    • Tree rings, corals, forest landscapes, ice rings, lake levels, etc.


  • These processes are called climate proxies.
    • They are proxy measurements for unobserved climate.
    • Noisy and messy.
    • Respond to a wide variety of non-climatic signals.

Pollen Data

–>

–> –> –> –> –> –> –> –> –> –> –> –> –>

–> –> –> –> –> –> –> –> –> –> –> –> –> –>

–> –>

–> –> –> –> –> –> –> –> –> –>

–> –> –> –> –> –> –> –> –> –> –> –>

–> –> –>

–> –> –> –> –> –>

–> –> –> –> –> –> –> –> –>